As a first step in the analysis of treatment effects, we will look at differential expression according to LIMMA. The first step in this will be to learn usable representations of the batch effects in the data with SVA. The resulting vectors can be fed straight to LIMMA to correct for confounders. First we will take the approach of keeping treatments sepparate, but once we an understanding of similarity between treatments, we will regroup the treatments into contrasts and look for common effects.
Meta data for this report are available in:
yeast_phospho_data_processed/site_data_v2/meta_runs_samples.csvUncorrected imputed data can be found in:
yeast_phospho_data_processed/site_data_v2/imputation/data_psites.vfilt.50.qrilc.rep1.csvThe SVA algorithm is set up to find effects in the data which cannot be explained by the treatments. This will allow us to go beyond just defined batches and find effects which may be within batch as well. Once we have these variables, we will include them directly in our design matrix.
## Number of significant surrogate variables is: 4
## Iteration (out of 5 ):1 2 3 4 5
From this graph, it is apparent that all 4 SVs are picking up major batch effecs in the data. In addition, there is significant spread of the variables within batch, which we believe implies that SVA is gathering information that is even more fine grained.
Given that SVA works a lot like a PCA, we can see it picking up similar batch sepparation. This doesn’t give more information than the previous plot besides a different way of visualization.
Sorting the runs by run order clearly gives a demonstration of the additional effects SVA is pulling out. Much of this is likely due to the amount of imputation that is occuring per file, but that is fine. The main thing is that we did not have to specify these effects, but they are identified anyways.
The standard in our lab for differential expression is LIMMA. While we tend to use it for standard differential expression against some control, the most useful part about it will be the ability to test custom contrasts. Many of the treatments we used are directly related to each other and can be grouped together. This may reveal common effects that are different from other groupings and increase power. To start, we will look at standard differential expression and then determine what it can say about treatment similarity.
## # A tibble: 1 × 1
## `median(improvement)`
## <dbl>
## 1 25.0
## # A tibble: 3 × 2
## label `median(variance)`
## <fct> <dbl>
## 1 Treatments 0.243
## 2 SVs 0.389
## 3 Full Model 0.614
Most of the variance in the data is explained away by the surrogate variables, which is good. There is defenitely left over effect from the treatments and that is what we will be interested in next.
The P-value distribution looks good so results have been writen to:
output/limma_diffential_expression_results.csvThis volcano plot seems a little off still. There are large effects which have very low q-values, which makes the volcano plot look very flat. We will carry on for now and play around with the testing later to determine if more can be done.
Full data counts:
## # A tibble: 1 × 3
## nTreatmentSitePairs nSites nProteins
## <int> <int> <int>
## 1 529400 5294 1390
Regulated counts:
## # A tibble: 1 × 3
## nTreatmentSitePairs nSites nProteins
## <int> <int> <int>
## 1 25182 4477 1281
Regulated 5 min counts:
## # A tibble: 1 × 3
## nTreatmentSitePairs nSites nProteins
## <int> <int> <int>
## 1 21954 4039 1202
## Warning: Removed 1 rows containing missing values (position_stack).
## Removed 1 rows containing missing values (position_stack).
## Removed 1 rows containing missing values (position_stack).
As we have seen in other analyses the SP condition definitely sticks out like a sore thumb.
## Warning: Removed 1 rows containing missing values (geom_point).
Looking at the above plot it is clear that almost all the regulatory effect is down regulation. There is definitely some up-regulation present but it makes up a minority of the effect comparatively.
There was some worry that the over abundance of downregulation could be an artifact of imputation. However, the over-representation of down regulation vs up regulation is anti-correlated with imputation, which seems to imply the that this a real effect of highly expressed sites.
## Warning: Removed 15 rows containing missing values (geom_smooth).
## Removed 15 rows containing missing values (geom_smooth).
## Warning: Removed 24 rows containing missing values (geom_smooth).
## Removed 24 rows containing missing values (geom_smooth).
## Warning: Removed 59 rows containing missing values (geom_smooth).
## Removed 59 rows containing missing values (geom_smooth).
The above plots show that there is a break point in the amount of treatments where sites are regulated. This seems to imply that some sites are more prone to be regulated in many treatments, possibly because of their importance.
The above plots give a great view into the overlap between conditions. In order to produce them, we just counted the number of overlapping significant coefficients between conditions. One thing that is apparent is that a large portion of treatments don’t have much going on. There is a good chance that the discovered coefficients may be false positives, but further investigation is warranted to determine whether there are any robust but unique effects. The other large portion of treatments have high overlap between each other. There are at least two apparent clusters, with distinct patterns of connection between them. In the next section, I will try to use certain subsets of the data to tease apart effects.